Skip to content

Conversation

@ben11211
Copy link

@ben11211 ben11211 commented Sep 11, 2025

Allows for graceful shutdown of the IsolateServicer and associated agents.

Adds the ISOLATE_SHUTDOWN_GRACE_PERIOD environment variable. Defaults to 60m.

Prior Behavior

  1. IsolateServicers client grpc connection closes, but this does not force an exit.
  2. termination interceptor is called which stops the server, but does not terminate agent tasks. Agent tasks exit when they exit, and then main returns.

Proposed Behavior

  1. IsolateServicer is initialized with signal handlers
  2. Shutdown event triggered via either of the following
    • Signal handlers receive SIGTERM or SIGINT
    • IsolateServicer's gRPC client is no longer active during single-run mode
  3. servicer.initiate_shutdown() called
  4. Servicer calls agent.Terminate() for all servicer's active background_tasks
  5. agent.Terminate SIGTERMs agent process pid by terminating self._bound_context, and SIGKILL after configured grace period

View client termination behavior before and after change.

shutdownbehavior.mov

⚠️ Note ⚠️

Grace period handling is not complete, because agent signal handling can not be propagated completely. When the agent subprocess receives a termination signal, the signal can only be received by the main execution thread of the python application which in this case is src/isolate/connections/grpc/agent.py:run_agent which is a running gRPC server. Agent gRPC calls are handled by worker threads, and the Run handler calls the provided function within the context of the handler thread. This means any provided functions being executed within an isolate agent are not run by a thread that is capable of receiving signals.

This means, within the current multithreaded agent approach - in order to provide graceful shutdown capabilities to isolated python functions, they must have an application layer signal such as a passed context. This requires an function running within an Isolate agent to be context aware if it wants to implement graceful shutdown.

Alternatively modify the agent gRPC server to use AsyncIO instead of the current multithreaded approach. This would not have a performance hit, since the current approach is configured for a single worker thread. Isolate functions can register signal handlers as they would normally without needing to be context aware.

Here's a rough sketch of what that would look like, with a test confirming the signal propagation through to execute_function does work.

@ben11211 ben11211 changed the title feat(server): add graceful shutdown with signal handling and client d… feat: add graceful shutdown with signal handling Sep 11, 2025
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch 5 times, most recently from c3cb5ef to b86f35d Compare September 12, 2025 07:45
@ben11211 ben11211 marked this pull request as ready for review September 12, 2025 07:47
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch from b86f35d to 3ac8323 Compare September 12, 2025 20:40
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch 2 times, most recently from e1ccbb6 to db01ee6 Compare September 12, 2025 21:17
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch 2 times, most recently from 021733c to 4abc42f Compare September 12, 2025 21:34
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch 2 times, most recently from cdb62ca to 7512c1e Compare September 12, 2025 22:22

server.add_insecure_port("[::]:50001")
print("Started listening at localhost:50001")
server.add_insecure_port(f"[::]:{options.port}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was necessary to support end to end testing concurrently.

@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch 2 times, most recently from 20c13d2 to 3cfa041 Compare September 12, 2025 22:34
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch from 3cfa041 to 490ff0a Compare September 12, 2025 23:09
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch from 490ff0a to 57a2fc2 Compare September 12, 2025 23:12
Comment on lines +142 to +149
try:
print(f"Terminating agent PID {proc.pid}")
proc.terminate()
proc.wait(timeout=shutdown_grace_period)
except subprocess.TimeoutExpired:
# Process didn't die within timeout
print(f"killing agent PID {proc.pid}")
proc.kill()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

Comment on lines 510 to 515
def initiate_shutdown(self, grace_period: float | None = None) -> None:
if self._shutting_down:
return
self._shutting_down = True
if grace_period is None:
grace_period = SHUTDOWN_GRACE_PERIOD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting grace_period as None reads as if it was the "infinite wait" option, but in reality you are assigning a default of SHUTDOWN_GRACE_PERIOD.

I would change grace_period to directly default to the SHUTDOWN_GRACE_PERIOD in the function params

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I've removed it from the function parameters. I realizes it was leftover from tests in a previous revision, when the terminate -> kill timing was managed within initiate_shutdown. Now this function only outputs a useful log with this value.

SHUTDOWN_GRACE_PERIOD is now passed through to the LocalPythonGRPC class during agent allocation, and it controls the terminate -> kill timing there instead.

# Collect all active agents from running tasks
shutdown_threads = []
for task in self.background_tasks.values():
if task.agent is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if it is None you should probably cancel it directly


if self._server:
print("Stopping gRPC server")
self._server.stop(grace=0.1) # Short grace period for server shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bringing the server into the servicer context has no real benefit, you can call initiate_shutdown and call server.stop() right after in the SingleTaskInterceptor context

Comment on lines 498 to 508
def register_signal_handlers(self, server: grpc.Server) -> None:
"""Set up signal handlers for graceful shutdown"""
self._server = server

def signal_handler(signum, _):
"""Handle SIGTERM and SIGINT by gracefully shutting down server"""
print(f"Received signal {signum}, shutting down server")
self.initiate_shutdown()

signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then you would need to move this function out of this context too, just to a context that has access to both the server and the servicer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do right before the main function a

def signal_shutdown(signum, _):
    print(f"Received signal {signum}, shutting down server")
    servicer.initiate_shutdown()
    server.stop()

def main():
    # ...
    signal.signal(signal.SIGTERM, signal_shutdown)
    signal.signal(signal.SIGINT, signal_shutdown)

# Process should be terminated after terminate_proc() returns
assert proc.poll() is not None, "Process should be terminated by SIGTERM"

def test_force_terminate(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one

def function_obj():
import time

time.sleep(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an infinite loop instead? to make sure it is just not taking 10 secs to finish the test but it is actually closing because of the disconnect

@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch from 382aa46 to c132ef5 Compare September 13, 2025 01:16
@ben11211 ben11211 force-pushed the server-agent-graceful-shutdown branch from c132ef5 to ccc811d Compare September 13, 2025 04:11
Comment on lines +20 to +33
def create_run_request(func, stream_logs=True):
"""Convert a Python function into a BoundFunction request for stub.Run()."""
bound_function = functools.partial(func)
serialized_function = to_serialized_object(bound_function, method="cloudpickle")

env_def = EnvironmentDefinition()
env_def.kind = "local"

request = BoundFunction()
request.function.CopyFrom(serialized_function)
request.environments.append(env_def)
request.stream_logs = stream_logs

return request
Copy link
Author

@ben11211 ben11211 Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmlad Here's a convenience function for passing normal functions to an IsolateServicer.Run call. FYI.

Comment on lines +670 to +677
def register_signal_handlers(servicer: IsolateServicer, server: grpc.Server) -> None:
def handle_signal(signum, frame):
print(f"Received signal {signum}, initiating shutdown...")
servicer.initiate_shutdown()
server.stop(grace=0.1)

signal.signal(signal.SIGINT, handle_signal)
signal.signal(signal.SIGTERM, handle_signal)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chamini2 Is this what you're suggesting here #174 (comment)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants